Lecture 10: Confusion matrices and model fit for linear regression

EBA3500 Data analysis with programming

Jonas Moss

BI Norwegian Business School
Department of Data Science and Analytics

Confusion matrices, ROC, and AUC

\(0-1\) predictions in binary regression

  • The predict method gives you a probability.
  • But often you have to make a make a prediction of \(\hat{Y}\) itself.
  • Will you follow a marketing lead? Do you believe candidate will get an \(A\) when he drinks 5 cups of coffee?
  • You usually construct predictions of \(Y\) using a threshold: \[\hat{Y}=1 \iff \hat{p} > c,\] where \(\hat{p}\) is a predicted probability obtained using predict and \(\hat{Y}\) is the predicted value of \(Y\).
  • For instance, you could decide to do the action if if \(\hat{p} > 0.5\).

Admission data (again)

import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
admission = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
admission.head()
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4

model = smf.logit("admit ~ gre + gpa + C(rank)", data = admission).fit()
plt.hist(model.predict())
Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6
(array([31., 69., 62., 67., 66., 40., 26., 20.,  8., 11.]),
 array([0.05878643, 0.12674861, 0.19471079, 0.26267297, 0.33063516,
        0.39859734, 0.46655952, 0.5345217 , 0.60248388, 0.67044606,
        0.73840825]),
 <BarContainer object of 10 artists>)

Confusion matrix

The confusion matrix shows shows true positives, false negatives, false positives, and true negatives.

\(\hat{Y}=1\) \(\hat{Y}=0\)
\(Y=1\) True positive False negative
\(Y=0\) False positive True negative

Confusion matrix for admission with c = 0.5

model.pred_table(threshold = 0.5)
array([[254.,  19.],
       [ 97.,  30.]])
\(\hat{Y}=1\) \(\hat{Y}=0\)
\(Y=1\) 30 97
\(Y=0\) 19 254
  • Question: How many true positives are there?
    • Answer: True positives happen when both \(\hat{Y}=1\) and \(Y=1\). Hence the correct answer is \(30\).
  • Question: How many false negatives are there?
    • Answer: False negatives correspond to \(\hat{Y}=0\) and \(Y=1\). There are 97 of these.

Confusion matrix for admission

model.pred_table(threshold = 0.6)
array([[266.,   7.],
       [114.,  13.]])
\(\hat{Y}=1\) \(\hat{Y}=0\)
\(Y=1\) 13 114
\(Y=0\) 7 266
model.pred_table(threshold = 0.5)
array([[254.,  19.],
       [ 97.,  30.]])
\(\hat{Y}=1\) \(\hat{Y}=0\)
\(Y=1\) 30 97
\(Y=0\) 19 254

Confusion matrix for admission

  • Question: How can you make sure the number of false positives is \(0\)?
    • Answer: Make the threshold \(c=1\)!
model.pred_table(threshold = 1)
array([[273.,   0.],
       [127.,   0.]])

True positive rate and false positive rate

\[ \text{true positive rate}=\frac{\text{true positives}}{\text{true positives} + \text{false negatives}} \] Also known as sensitivity, recall, and hit rate. Abbreviated as tpr.

\[ \text{false positive rate}=\frac{\text{false positives}}{\text{false positives} + \text{true negatives}} \]

Also known as fall-out. Abbreviated as fpr.

\(\hat{Y}=1\) \(\hat{Y}=0\)
\(Y=1\) 30 97
\(Y=0\) 19 254
  • Here \(TPR = \frac{30}{30 + 97} = 0.24\) and \(FPR = \frac{19}{19 + 254} = 0.07\).

Receiver operating characteristic curve

The receiver operating characteristic curve. Source: Wikipedia.

How to make these plots

from sklearn import metrics
fpr, tpr, _ = metrics.roc_curve(admission.admit, model.predict())
plt.plot(fpr, tpr)
plt.show()

The area under the curve (i)

The area under the curve (ii)

  • The area under the curve (AUC) is the integral of the receiver operating characteristic curve.
    • \(0.5 \leq AUC \leq 1\).
    • \(AUC = 1\) if we can predict perfectly.
    • \(AUC = 0.5\) if we can’t predict at all.
  • Extremely widespread!
  • Since the AUC doesn’t penalize model complexity it shouldn’t be used to compare models!

The \(R^2\) for binary regression models

Remember the setup for the \(R^{2}\) for non-linear regression: \[ 1-\frac{\sum_{i=1}^{n}(y_{i}-f(x;a_{1},\ldots a_{p}))^{2}}{\sum_{i=1}^{n}(y_{i}-m)^{2}}\] In general we could write \[R^{2}=1-\frac{R(x)}{R(m)}\] Where

  • \(R(m)\) measures the distance between our best predictions and the data \(y\), and
  • \(R(x)\) measures the distance between our best predictions and the data \(y\) when we take \(x\) into account.

The log-likelihood and the \(R^2\)

We use the log-likelihood instead of the quadratic loss! So, letting \(p_i = f(x;a_{1},\ldots a_{p})\), define the fitted likelihood \[l(x) = R(x) = \sum_{i=1}^{n}-y_{i}\log \hat{p}_{i}-(1-y_{i})\log(1-\hat{p}_{i}).\]

\[l(m) = R(m) = \sum_{i=1}^{n}-y_{i}\log m-(1-y_{i})\log(1-m),\] is the log likelihood of the null model. It measures how well we can predict \(y\) when we don’t know any \(x\) at all.

McFadden’s \(R^2\)

Now define the pseudo-\(R^2\) or McFadden \(R^2\) as \[R^2_{\textrm{McFadden}} = 1 - \frac{l(x)}{l(m)}.\]

  • Since McFadden’s \(R^2\) doesn’t penalize model complexity it shouldn’t be used to compare models!

An example

probit_fit = smf.probit("admit ~ gre + C(rank) + gpa", data = admission).fit()
logit_fit = smf.logit("admit ~ gre + C(rank) + gpa", data = admission).fit()

def rsq_mcfadden(fit):
    lower = fit.llnull
    upper = fit.llf
    return 1 - upper / lower

print(rsq_mcfadden(probit_fit), rsq_mcfadden(logit_fit))
Optimization terminated successfully.
         Current function value: 0.573016
         Iterations 5
Optimization terminated successfully.
         Current function value: 0.573147
         Iterations 6
0.08313059666890721 0.08292194470084713
  • Thus the \(R^2\)s are \(0.0831\) and \(0.0829\). But we may also access McFadden’s \(R^2\) directly: print(probit_fit.prsquared, logit_fit.prsquared).

  • Question: Can you compare these \(R^2\)s to the \(R^2\)s from linear regression?

    • Answer: No, as we are looking the likelihoods now, not the quadratic distance. Also, the “ordinary” \(R^2\) tends to be small for binary data.

McFadden’s \(R^2\) vs the ordinary \(R^2\)

The following picture, Figure 5 from Chapter 5 of McFadden’s Urban Travel Demand: A Behavioral Analysis (1996) illustrates the typical relationsship between the least squares \(R^2\) (on the \(y\)-axis) and the pseudo-\(R^2\) on the \(x\)-axis.

McFadden’s comparison if the ordinary R^2 and McFadden’s R^2. Source: McFadden, D. Urban Travel Demand: A Behavioral Analysis (1996)

Variable selection in linear regression

Akaike’s information criterion (recap)

  • Akaike’s information criterion (AIC), defined as \[\text{AIC} = 2p - 2l,\] where \(p\) is the number of estimated parameters and \(l\) is the log-likelihood at the estimated parameters.
  • The smaller the AIC the better!
  • We add \(2p\) since a larger number of parameters allows for more noise. It penalizes model complexity.
  • Penalization of model complexity is always needed when comparing model with different number of parameters!
  • There is a theoretical argument for the choice \(p\)! but that is beyond the scope of this course.

The adjusted \(R^2\)

  • The \(R^2\) is good for evaluating how well be can predict the outcome given our covariates, but it’s not good for choosing between models.
  • It doesn’t correct for the bias that occurs when using the same data both to estimate the model parameters and evaluating model fit.
import numpy as np
rng = np.random.default_rng(seed = 313)
p = 10
n = 100
x = rng.normal(0, 1, (n, p)) # think about y ~ 1 + x1 + x2 + ... + x_p
y = rng.normal(3, 2, n) # True R^2 is 0!

  • The true \(R^2\) for y ~ x is \(0\) in this model. However, its expectation is not.
  • Can show that \[R^2 \sim \textrm{Beta}\left(\frac{k-1}{2},\frac{n-k}{2}\right),\] where \(k\) is the number of parameters including the intercept. (That is, \(k = p + 1\).)
  • Using the properties of the Beta distribution, we find that the expected value of \(R^2\) is \[E(R^2) = \frac{\frac{k-1}{2}}{\frac{k-1}{2} + \frac{n-k}{2}} = \frac{k-1}{n - 1}.\]
  • This number might be quite large when the number of predictors is close to the number of observations!

import statsmodels.formula.api as smf
n_reps = 5000
yy = rng.normal(3, 2, (n_reps, n))

def func1d(y):
  fit = smf.ols("y ~ x", data = {}).fit() # y ~ x1 + x2 + ... + x_p
  return fit.rsquared

rsqs = np.apply_along_axis(
    func1d = func1d,
    axis = 1,
    arr = yy
)

[np.mean(rsqs), p / (n-1)]
[0.10158243551749126, 0.10101010101010101]

Constructing the adjusted \(R^2\) (i)

  • Recall the definition of the \(R^2\): \[R^2 = 1 - \frac{\textrm{Sum of squares with predictors}}{\textrm{Sum of squares without predictors}}\]

  • We can show that \(\textrm{Sum of squares with predictors}\) is biased for its population value, the true sum of squares with predictors at our estimated regression coefficient. But we can correct for this bias!

  • One can show that \[\frac{n}{n-p-1}E(\textrm{Sum of squares with predictors})\] equals the true population sum of squares with predictors. Here \(p\) is the number of estimated regression coefficients, minus the intercept.

Constructing the adjusted \(R^2\) (ii)

  • Moreover, we can show that \[\frac{n}{n-1} E(\textrm{Sum of squares with predictors})\]equals the true, population value of the sum of squares without predictors.
  • It follows that a reasonable corrected \(R^2\) is \[R_a^2 = 1 - \frac{\frac{n}{n-p-1} \textrm{Sum of squares with predictors}}{\frac{n}{n-1}\textrm{Sum of squares without predictors}}\]
  • Rearranging this, we find that \[R_a ^ 2 = 1 - (1 - R^2) \frac{n-1}{n-p-1} .\]

Notes on the adjusted \(R^2\)

  1. The adjusted \(R^2\), or \(R^2_a\), can be less than \(0\). Try to understand why.
  2. We haven’t proved that the adjusted \(R^2\) squared is unbiased for the true \(R^2\). Do you think it is?
  3. Can you devise a simulation study to explore this problem?

Comments

  • These methods can be used together with automatic mechanisms for selecting covariates, e.g. backwards and forwards regression.
  • The BIC (“Bayesian information criterion”) is similar to the AIC, but defined as \(k\log n - 2\log L\)! (The AIC is \(2k-2\log L\)).
  • There are many different methods too, but adjusted \(R^2\), ANOVA and AIC are most popular.

Model fit

Conditions for inference in regression

  1. The linear relationsship is true, \(y_i=\beta^T x_i + \epsilon\).
  2. No autocorrelaton. The errors \(\epsilon_i\) are independent of each other given \(x_i\). (This is important.)
  3. Normally distributed errors. The errors are normally distributed. (Not very important.)
  4. Homoskedastic errors. The variance of the errors should not depend on \(x_i\).

Residuals

  • Defined as \(\hat{y} - y\). It’s common to look at plots of residuals.
  • We often plot “residuals vs. fitted”, i.e. \(\hat{y} - y\) vs \(\hat{y}\)
  • The plots should be without patterns!

Residuals: Heteroskedastic errors

rng = np.random.default_rng(seed = 313)
x = rng.uniform(size=100)
y = 1 + x + rng.normal(size=100) * (x + 0.5) ** 2
mod = smf.ols("y ~ x", data = {}).fit()
plt.scatter(mod.predict(), mod.resid)
<matplotlib.collections.PathCollection at 0x2125c9c2320>

  • The residuals are smaller when the absolute value of the fitted observations is small!

  • You can make such plots when dealing with multiple covariates too.

QQ plot of residuals

Source: Introduction to Linear Regression Analysis (Elizabeth Peck, Geoffrey Vining)

QQ plot: About \(n\)

  • Small \(n<20\) often don’t yield useful quantile-quantile plots!

  • Bu \(n>30\) usually suffices.

QQ plot: \(n = 100\), normal distribution

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed = 313)
x = rng.normal(size = 100)
sm.qqplot(x, line ='45')
plt.show()

QQ plot: \(n = 10\), normal distribution

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
rng = np.random.default_rng(seed = 3)
x = rng.normal(size = 10)
sm.qqplot(x, line ='45')
plt.show()

Looks like heavy tails!

QQ plot: Large \(n\), horrible distribution

rng = np.random.default_rng(seed = 3)
x = rng.standard_t(size = 50, df = 3)
sm.qqplot(x, line ='45')
plt.show()

You’re looking for seriously heavy tails!

QQ plot: Large \(n\), good distribution

rng = np.random.default_rng(seed = 3)
x = rng.uniform(size = 50, low = 0, high = 1)
sm.qqplot(x, line ='45')
plt.show()

Heteroskedasticity

import seaborn as sns
rng = np.random.default_rng(seed = 313)
x = rng.uniform(size=50)
y = 1 + x + rng.normal(size=50) * (x + 0.5) ** 2
data = pd.DataFrame({"x": x, "y": y})
sns.lmplot(data = data, x = "x", y = "y")
<seaborn.axisgrid.FacetGrid at 0x2125fa7da20>

Robust standard error’s using White (1980)

  • Inferential procedures such as confidence intervals and p-values are invalid under heteroskedasticity.
  • Us the argument cov_type='HC0' inside the fit methods for robust standard errors.
smf.ols("y ~ x", data = data).fit().pvalues
Intercept    0.018321
x            0.035364
dtype: float64
smf.ols("y ~ x", data = data).fit(cov_type='HC0').pvalues
Intercept    0.002156
x            0.028677
dtype: float64

White, H. (1980). A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica, 48(4), 817–838. https://doi.org/10.2307/1912934

Robust regression

Telephone data

Number of international calls from Belgium, taken from the Belgian Statistical Survey, published by the Ministry of Economy.

import seaborn as sns
import rdatasets as rd
telef = rd.data("Robustbase", "telef")
sns.lmplot(data = telef, x = "Year", y = "Calls")
<seaborn.axisgrid.FacetGrid at 0x2125fa4b0a0>

  • What’s happening here? Clerical errors! Such errors are abundant.

  • Two ways to fix such problems:

    • (a) Remove the “bad” data. But this is not always feasible.

    • (b) Use robust regression methods. Such methods are – sometimes – able to automatically ignore data that should be ignored.

Using rlm

robust = smf.rlm(
  "Calls ~ Year", 
  data = telef, 
  M = sm.robust.norms.HuberT()).fit()
sns.lmplot(data = telef, x = "Year", y = "Calls")
plt.plot(telef["Year"], robust.predict(), color = "red")

Duncan prestige (here)

The Duncan data frame has 45 rows and 4 columns. Data on the prestige and other characteristics of 45 U. S. occupations in 1950.

prestige = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
prestige.head()
type income education prestige
accountant prof 62 86 82
pilot prof 72 76 83
architect prof 75 92 90
author prof 55 90 76
chemist prof 64 86 90

Is there an outlier here?

prestige = sm.datasets.get_rdataset("Duncan", "carData", cache=True).data
fig = plt.figure(figsize = (10, 7))
ax = plt.axes(projection ="3d")
ax.scatter3D(prestige.income, prestige.education, prestige.prestige, alpha=0.7)
plt.show()

Fitting the Duncan prestige data

prestige_model = smf.ols("prestige ~ income + education", data=prestige).fit()
print(prestige_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.828
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                     101.2
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           8.65e-17
Time:                        14:02:05   Log-Likelihood:                -178.98
No. Observations:                  45   AIC:                             364.0
Df Residuals:                      42   BIC:                             369.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -6.0647      4.272     -1.420      0.163     -14.686       2.556
income         0.5987      0.120      5.003      0.000       0.357       0.840
education      0.5458      0.098      5.555      0.000       0.348       0.744
==============================================================================
Omnibus:                        1.279   Durbin-Watson:                   1.458
Prob(Omnibus):                  0.528   Jarque-Bera (JB):                0.520
Skew:                           0.155   Prob(JB):                        0.771
Kurtosis:                       3.426   Cond. No.                         163.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Residual plot for Duncan prestige data

plt.scatter(prestige_model.resid, prestige_model.predict())
plt.show()

Robust regression

robust = smf.rlm(
  "prestige ~ income + education", 
  data = prestige, 
  M = sm.robust.norms.HuberT()).fit()
print(robust.summary())
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:               prestige   No. Observations:                   45
Model:                            RLM   Df Residuals:                       42
Method:                          IRLS   Df Model:                            2
Norm:                          HuberT                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Tue, 01 Nov 2022                                         
Time:                        14:02:06                                         
No. Iterations:                    18                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -7.1107      3.879     -1.833      0.067     -14.713       0.492
income         0.7015      0.109      6.456      0.000       0.489       0.914
education      0.4854      0.089      5.441      0.000       0.311       0.660
==============================================================================

If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .

Summary (i)

  1. The ROC displays the relationsship between the true positive rate and the false negative rate.
  2. The AUC measures how good your model is at predicting, with \(0.5\) being no better than chance and \(1\) being perfect. The AUC should not be used to compare models. Use the AIC for that.
  3. McFadden’s \(R^2\) is a generalization of the \(R^2\) to binary models.

Summary (ii)

  1. You can handle outliers and clerical errors using robust regression models.
  2. Residual plots can be used to find out if the residuals are heteroskedastic. If the residuals are heteroskedastic, you may use White (1980) corrected standard errors.
  3. Quantile-quantile plots are used to check if residuals are normal. Be on the lookout for heavy tails.
  4. The adjusted \(R^2\) can be used for model selection.